A domain sequence approach to pangenomics: applications to Escherichia coli

نویسندگان

  • Lars-Gustav Snipen
  • David W Ussery
  • Frederic Bertels
  • Barry Wanner
چکیده

The study of microbial pangenomes relies on the computation of gene families, i.e. the clustering of coding sequences into groups of essentially similar genes. There is no standard approach to obtain such gene families. Ideally, the gene family computations should be robust against errors in the annotation of genes in various genomes. In an attempt to achieve this robustness, we propose to cluster sequences by their domain sequence, i.e. the ordered sequence of domains in their protein sequence. In a study of 347 genomes from Escherichia coli we find on average around 4500 proteins having hits in Pfam-A in every genome, clustering into around 2500 distinct domain sequence families in each genome. Across all genomes we find a total of 5724 such families. A binomial mixture model approach indicates this is around 95% of all domain sequences we would expect to see in E. coli in the future. A Heaps law analysis indicates the population of domain sequences is larger, but this analysis is also very sensitive to smaller changes in the computation procedure. The resolution between strains is good despite the coarse grouping obtained by domain sequence families. Clustering sequences by their ordered domain content give us domain sequence families, who are robust to errors in the gene prediction step. The computational load of the procedure scales linearly with the number of genomes, which is needed for the future explosion in the number of re-sequenced strains. The use of domain sequence families for a functional classification of strains clearly has some potential to be explored.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A domain sequence approach to pangenomics: Applications to Escherichia coli [v1; ref status: Indexed, http://f1000r.es/QSnDE6]

The study of microbial pangenomes relies on the computation of gene families, i.e. the clustering of coding sequences into groups of essentially similar genes. There is no standard approach to obtain such gene families. Ideally, the gene family computations should be robust against errors in the annotation of genes in various genomes. In an attempt to achieve this robustness, we propose to clus...

متن کامل

Functional motifs in Escherichia coli NC101

Escherichia coli (E. coli) bacteria can damage DNA of the gut lining cells and may encourage the development of colon cancer according to recent reports. Genetic switches are specific sequence motifs and many of them are drug targets. It is interesting to know motifs and their location in sequences. At the present study, Gibbs sampler algorithm was used in order to predict and find functional m...

متن کامل

Expression and Secretion of Human Granulocyte Macrophage-Colony Stimulating Factor Using Escherichia coli Enterotoxin I Signal Sequence

With the aim of the secretion of human granulocyte macrophage-colony stimulating factor (hGM-CSF) in Escherichia coli, hGM-CSF cDNA was fused in-frame next to the signal sequence of ST toxin (ST-I) of exteroxigenic E. coli, containing 53 or 19 amino acids of signal peptide. The fused STsig::hGM-CSF coding fragments were inserted into a T7-based expression plasmid. The recombinant plasmids were ...

متن کامل

Molecular Cloning and Characterization of a Lipase from an Indigenous Bacillus pumilus

Cloning and sequencing of a lipase gene from an indigenous Bacillus pumilus, strain F3, revealed an open-reading frame of 648 nucleotides predicted to encode a protein of 215 residues. Sequence analysis showed that F3 lipase contained a signal peptide composed of 34 amino acids with an H domain of 18 residues. A tat-like motif was found in the signal peptide similar to some other Bacillus pumil...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره 1  شماره 

صفحات  -

تاریخ انتشار 2012